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The massive network traffic data between connected devices in the internet 
of things have taken a big challenge to many traditional intrusion detection 
systems (IDS) to find probable security breaches. However, security attacks 
lean towards unpredictability. There are numerous difficulties to build up 
adaptable and powerful IDS for IoT in order to avoid false alerts and ensure a 
high recognition precision against attacks, especially with the rising of 
Botnet attacks. These attacks can even make harmless devices becoming 


Keywords: zombies that send malicious traffic and disturb the network. In this paper, we 
Bot-loT propose a new IDS solution, baptized BotIDS, based on deep learning 
os convolutional neural networks (CNN). The main interest of this work is to 
Botnet design, implement and test our IDS against some well-known Botnet attacks 
CNN using a specific Bot-IoT dataset. Compared to other deep learning 
DL techniques, such as simple RNN, LSTM and GRU, the obtained results of 
IDS our BotIDS are promising with 99.94% in validation accuracy, 0.58% in 
IoT validation loss, and the prediction execution time is less than 0.34 ms. 
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1. INTRODUCTION 

Nowadays an enormous number of objects are dispatched around the world and are connected 
between them and to the internet. They vary from personal gadgets, wearables, sensors, actuators to home 
appliances and medical devices. As estimated by CISCO in 2025, it would be somewhere 75 billion devices 
connected to the Internet [1]. The IoT has raised concerns that are growing rapidly without fitting thought of 
the significant security challenges [2]. 

Nowadays, most of the security concerns are like those of regular servers, workstations and 
smartphones; however, security moves extraordinarily to the IoT, including mechanical security controls, 
hybrid frameworks, IoT-explicit business procedures, and edge devices [3]. Traditional security protection 
technology is limited due Zero-Day attacks and vulnerabilities and future new attacks that are continuously 
changing nature; setting up a steady, reliable, and precise intrusion detection is becoming mandatory for 
improving the IoT security [4]. 

Botnets or zombies are robots of infected internet-connected devices, they are used to achieve 
distributed denial-of-service attack (DDoS attack), password cracking, key logging (steal data), 
cryptocurrency mining, and give the attacker the possibility of accessing the device using command and 
control (C&C) software [5]. In September 2016, “Mirai” malware, an IoT Botnet attacked many sites offline 
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like the cloud service provider “OVH” with nearly 1.1 TBps, the website of computer security consultant 
Brian Krebs with 620 Gbps of traffic, and many other websites like dynamic DNS provider “Dyn” [6]. 

The detection and prevention from different attacks are a big challenge. IDS using machine-learning 
methods, has gained a wide reputation [7]. IDS is an essential component in the security mechanism, it is 
used for the analysis and detection of the security breaches on a network [8]. IDS systems can be gathered 
into two categories: the first one “anomaly detection” and the second is “Misuse detection”, or gathered into 
three major distinct families “host-based”, “network-based” and “hybrid”. IDS investigate both traffic in the 
network and in the operating systems. IDS are used for effective network protection. Numerous research 
works are trying to apply data mining and machine learning algorithms to cyber security. In Machine 
learning, patterns recognition and data mining algorithms are extensively applied to distinguish the normal 
traffic from the malicious one. 


2. ARTIFICIAL NEURAL NETWORKS (ANN) 

In last recent years, one of the most resulting and efficient subsets of artificial intelligent is deep 
learning (DL) which it is also a subset of machine learning based on artificial neural networks (ANN) [9]; a 
computing system inspired from biological brain where the machine learns from many training examples, 
allowing it to classify other examples [10]. DL is increasingly being used. It can be applied in many data 
processing layers into a hierarchical architecture to make a deep model. DL accounts on its capacity to 
identify ideal features in raw data through successive nonlinear transformations, with every alteration 
achieving a more elevated level of complexity and abstraction [10]. It has been applied efficiently to many 
different research fields, from medical image processing, natural language processing, speech recognition, 
and signal recognition to many other domains of science, business and government. In all these fields, DL 
showed tremendously promising results [11]. In the IoT security field, the machine trains on various 
collected and labeled attacks and also normal traffic to learn them, were finally this machine can identify new 
similar attacks. 

Convolutional neural networks (CNN or ConvNets): it’s a deep learning class developed in 1998 by 
LeCun in the LeNet architecture [12]. In recent two decades, CNN gained big success. It’s composed of an 
input layer, many hidden layers in between, and an output layer as shown in Figure 1 like the multi layer 
perceptron (MLP) [13] networks. Best known and used layers are: convolution, activation or ReLU, and 
pooling [14]. The convolutional layer is the most important one, it takes a convolution kernel also called a 
mask or a filter then pass it over the data (usually images) and transform it based on the values from the filter 
as shown in Figure 1. Then it calculates the feature map values using the formula (1), where the input data is 
represented by f and the kernel by h, m and n are respectively the indexes of rows and columns of the 
resultant matrix [15]. 

The pooling layer it is what achieves progressively down sampling to reduce the size of the 
succeeding layers through max pooling or average pooling to help overfitting. Max pooling divides the input 
into non-overlapping clusters and selects the maximum value for each cluster in the previous layer [16]. 


G[m,n] = (f *h)[mn] = YL hl, kl fl[m — jn — k] (1) 
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Figure 1. CNN layers 
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Contrariwise to the traditional feature selection algorithms it has the capability of learning better 
features automatically and categorize the traffic. In addition, it can achieve better classification and learn 
additional features with more traffic data because it shares the same convolution matrix (kernel), that would 
decrease the number of parameters and calculation sum of training significantly. This gives CNN a fast 
recognition of attack nature, contrariwise to other deep-learning algorithms, or machine learning algorithms 
that can be over-fitted with massive big data. Moreover, the literature shows that using CNNs in intrusion 
detection field gives better results than other algorithms [17-18]. 

Recurrent neural networks (RNN) are a class of deep neural networks that contains feedback 
connections as shown in Figure 2. The fully connected layer works on a flattened input where each of these 
inputs is connected to all neurons. The activation function of a node describes its output given a one or set of 
inputs. Rectified linear unit (ReLU) activation as shown in Figure 3(a), it is a ReLU used on all elements of 
the volume. It aims at introducing non-linearities to the network. LSTM is composed of memory blocks that 
are a set of recurrent connected subnetworks. These blocks are composed with a self-connected memory cells 
(one or many) as shown in Figure 4 which offer a memory to remember the previous data, and three units 
called gates: input gate (3.a), a forget gate (3.b) and an output gate (3.c) which they provide a continuous 
equivalent of write, read and reset operations [21]. These gates are sigmoid as shown in Figure 3(b) and tanh 
as shown in Figure 3(c) activation functions meaning that their output is a value between 0 and 1 for sigmoid, 
and between -1 and 1 for tanh. Derived from feedforward neural networks (FNN) but unlike FNN, there are 
loops (bidirectional data flow) and memories to remember previous computations as shown in Figure 4 [19]. 
And allowing preceding outputs to be used as inputs while having hidden states [20] where for each timestep 
t , the activation a<** and the output y<‘*are expressed: 


as? = gy(Wagas')? + Wax + bg) (2.4) 
and ys? = 92(W,,as™ + by) (2.b) 


Where Wax, Waa, Wya, Da, by are coefficients that are shared temporally and g;, gz activation 
functions 


(2) 
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Figure 2. Recurrent neural networks layers 


RNN can face the long-term dependency problem and the vanishing gradient & exploding gradient, 
so we cite here the best known RNN networks, the long short-term memory (LSTM) and gated recurrent unit 
(GRU) networks to solve these problems. The main difference to simple RNN is that the nonlinear units in 
the hidden layers are replaced by memory blocks. The following formula (3) represents the gates in LSTM 


[22]. 
ip = o(@;[Np_1,X;,] + B; (3.8) 
fe = o(we[he_1, X¢] + by (3.b) 
Op = O(Wo[he_1, Xz] + by (3.c) 


O99 
l 


(3) 
Where: x; are the input gates (“i” for the input gate, “f” for the forget, and “o” for the output gate); 


“o it is the sigmoid function; 
“b;” it is the biases for the gate(x); 
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“hy_1” it is the output of the precedent LSTM block; 

“x,” it is the current input. 

Gated recurrent unit (GRU) is a simplified version of LSTM, where the GRU modulates the flow 
of information inside the unit using gating units as shown in Figure 4, without separating the memory cells 
[23-24]. It merges the forget and the input gates into an “update gate”, also merges cell and hidden state, 
GRU has fewer parameters than the LSTM. It is defined by the following formulas. 


E _ sigm(WyyXt F Wrrhe—-1 + b,) (4.a) Ze = sigm(Wy2X¢ + Wrzhe-1 + bz) er 


h, = tanh(W,),X,; + Wrn(% © hy-1) + by) (40) hy = 2,OK4-1 + (1 + z,)Oh, (4.d) (4) 
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Figure 3. Activation functions: (a) ReLU, (b) Sigmoid, (c) Tanh 
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Figure 4. RNN, LSTM & GRU blocks [25] 


3. RELATED WORKS 

Koroniotis et al. [26] applied support vector machine (SVM), LSTM and RNN to evaluate the IoT- 
Dataset. The authors focused on binary classification on the dataset, and their prediction result was either a 
“normal traffic” or “some type of attacks” (for every type of attack), which is not helpful for implementing 
many models (for every attack type) to a working IDS contrary to a multi-label output (numerous categories 
of attacks) that gives one and only one model. 

Ibitoye et al. [27] in their research compared the performance between two deep learning models 
self normalizing networks (SNN) and feed forward neural networks (FNN) inside the milieu of an IoT 
network. This comparison shows that FNN outperforms SNN, even if SNN remains better in regards to 
adversarial samples. Also, the authors examined the impact of feature normalization on the adversarial 
strength and demonstrated its bad influence to adversarial attacks resisting. 

Ferrag et al. [28] in his paper conducted a comparative study with two datasets, Bot-loT and CSE- 
CIC-IDS2018 datasets using some deep learning approaches, such RNN, CNN, Boltzmann machine, deep 
belief networks (DBN), deep Boltzmann machines (DBM), deep autoencoders and deep discriminative 
models, with 100 hidden layers to get an accuracy of 98.394%. 

Mengmeng [29] proposed using FNN an intelligent binary and multiclass classification, but with 
just few classes to get 99% in all evaluation measures (accuracy, precision, recall and FI score) for 
DDoS/DoS attacks while the normal traffic classification got an accuracy of 98%. 

AlKadi [30] proposed a system named mixture localization-based outliers (MLO) on the BoT-IoT 
Dataset that uses utilizes gaussian-mixture models for fitting network data and a local outlier factor function 
for discovering abnormal patterns in network traffic data, and gotten an accuracy of 97.98%. 
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In fact, convolutional neural networks (CNN or ConvNets) are a class of deep neural networks that 
are used in many fields but mostly in pattern recognition. CNN is a class of neural networks that uses the 
convolution and the pooling layers instead of the fully connected hidden layers [31]. Contrariwise to the 
traditional feature selection algorithms it has the capability of learning better features automatically and 
categorize the traffic. In addition, it can achieve better classification and learn additional features with more 
traffic data because it shares the same convolution matrix (mask), that would decrease the number of 
parameters and calculation sum of training significantly. This gives CNN a fast recognition of attack nature, 
contrariwise to other deep-learning algorithms, or machine learning algorithms that can be over-fitted with 
massive big data. Moreover, the literature shows that using CNNs in intrusion detection field gives better 
results than other algorithms [17-18]. 

The related work listed above can provide a good prediction of botnet attacks that can affect an IoT 
system. However, these works could not recognize type of attack due to the binary classification. Hence our 
study constitutes an important experimental extension of the above-mentioned works by benchmarking the 
Bot-IoT dataset using CNN compared to different deep learning models, using the multilabel classification 
corresponding to various categories of attacks in the IoT. 


4. PROPOSED METHOD 

BotIDS is our proposed network IDS obtained by learning from the Bot-IoT dataset. This solution is 
planned to be placed in a fog node when it will be implemented in a real IoT environment. A such 
deployment gives it the power of analyzing in real time the inbound and outbound traffic through the network 
by sniffing it as shown in Figure 5. This location will make our BotIDS able to monitor all traffic to/from the 
devices both inside and outside the network, and even particularly the traffic between this insider devices in 
case of a lurking zombie device inside the network. 
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Figure 5. Architecture of proposed approach 


The BotIDS is a deep learning-based method on a deep learning model that contains three phases: as 
shown in Figure 6 
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Figure 6. Model building process 


— Ist Phase: Dataset preprocessing; First of all, we need to alter the raw data and normalize its values with 
the goal of the best performance of deep learning model, and then convert it into image shape. 

— 2nd Phase: Building the model; the model is at first fit on a training dataset (a part of the dataset) using 
parameters to achieve the improvement of the model performance, these parameters are changed in the 
training process to reach better performance. Secondly, the test dataset (the remaining part of the 
dataset) is used to validate the accuracy of the model. 

— 3rd Phase: Evaluating the model by prediction; after building and generating the model, we evaluate 
this model with the test dataset by predicting attacks and calculating the time needed for this prediction. 


4.1. Dataset preprocessing 

For conducting proposed work, we have used the latest Bot-IoT dataset [32] that was created 
specifically for loT systems by an actual network milieu at the Cyber Range Lab of the Center of UNSW 
Canberra Cyber. The environment incorporates a combination of usual normal and bad traffic, with six types 
of attacks and 10 subcategories, namely, reconnaissance (service scanning and OS fingerprinting), DDoS 
(TCP, UDP and HTTP), DoS (TCP, UDP and HTTP), theft (key logging and data exfiltration). 

With 72 million records of data traffic simulated IoT environment. The whole data was ascended 
down to 5% into a “full-feature” dataset with around 3.6 million records and another version called “10 best 
features” is also provided with selection of best features from the “Full features” version, both versions are 
used for our experiment. The training and test dataset have 11 output classes which reflect the normal traffic, 
and the 10 types of attacks which were carried out against the oT network. 

The Bot-IOT dataset contains network connection attributes; nominal, numeric and IP addresses. 
We convert the nominal data to numeric data, ipv4 and ipv6 also be converted to numerical shape, and 
merging category and subcategories fields into one field that contains 10 types of attacks and the 11th is a 
normal traffic then we convert the new category attribute using “one-hot encoding”, and dropped the binary 
“Attack” field cause. Our focus is on a multilabel output and not a binary one. 

After encoding the data, we normalized it using Scikit-learn; meaning scaling the vectors 
individually to unit norm, and converting the normalized output data into image data shape. 

Then we split the data at first into data X (contains all the features except the “category” feature) 
and label Y (contains the “category” feature), and then split it into random training subset and testing subset 
with 75% for training set and 25% for the testing set. Figure 7 shows the number of data rows for each set 
and each attack type. 
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Figure 7. Attacks distribution in training and testing sets 


4.2. Building our models 

The CNN models were defined to have an input layer with the number of neurons equal to the 
amount of input features, four hidden layers Convolution2D layer, MaxPooling2D layer, Flatten layer, Dense 
layer and an output layer. For the best features dataset, the model was trained in 10 epochs (the whole dataset 
is passed through the neural network 10 times) with batch size of 32 (amount of training simples in a single 
batch is 32) and a kernel size of (1, 10). The neural network comprises 16 input neurons (in the first layer, the 
same number as the features), with 4 intermediate (hidden) layers with 32 (Convolution2D), 32 
(MaxPooling2D), 480 (Flatten), 22 (Dense) neurons, and 11 output neurons for the multilabel classification 
as shown in Figure 8. For our full-feature dataset, the model was trained in 15 epochs (batch size of 32 and a 
kernel size of (1, 10)), and had a 40-neuron input layer, same number and consistency of hidden layers as 
with the best feature model and 11 output neurons for the multilabel classification. In both cases, for the input 
and hidden layers, the activation function that was used was ‘Relu’, while the output layer activation function 
was ‘Softmax’ as shown in Figure 8 and Table 1. 

The other recurrent neural network (RNN) models were defined to have one input layer with the 
number of neurons equal to the amount of input features, four hidden layers and an output layer. For the best 
features dataset, the model was trained in 40 epochs with 32 in the batch size. The neural network comprised 
16 input neurons (in the first layer, the same number as the features), with 4 (SimpleRNN/LSTM/GRU) 
intermediate (hidden) layers with 32, 64, 128, 22 neurons, and 11 output neurons for the multilabel 
classification as shown in Figure 9. For our full-feature dataset, the model was trained in 40 epochs (32 in the 
batch size), and had a 40-neuron input layer, same number and consistency of hidden layers as with the best 
feature model and 11 output neurons for the multilabel classification. In both cases, for the output layer 
activation function was ‘Softmax’ as shown in Figure 9, and Table 1. 
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Table 1. Models parameters for the implemented models 


DL Convolutional neural network Simple recurrent neural Long short-term memory Gated recurrent unit 
Algorithm (CNN) network (RNN) (LSTM) based RNN (GRU) based RNN 
Dataset Full Best Full Best Full Best 
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version features features features features features features 
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function Softmax 
Optimizer adam adam adam adam 
Batch size 32 32 32 32 


5. RESULTS AND DISCUSSION 
5.1. Hardware characteristics 

The results we obtained are performed on a machine (laptop) with following hardware 
characteristics. 
— CPU: Intel i7 8th generation (1 socket, 6 cores, 12 threads) 
— RAM: 8 GB 
— GPU: NVidia GeForce GTX 1050 with cuda v10.1 

In our experiments we worked with Keras (2.2.4); an open-source python deep learning library 
which is running on top of Google’s open-source data flow software; TensorFlow (1.13.1) as a backend 
engine. 


5.2. Evaluating the model 

In Figures 10-11 (generated by Tensorboard) and Table 2, we present the accuracy training, loss 
training, accuracy validation, loss validation along with training time of all models that are trained over 
several epochs. We consider the number of epochs for which our model reaches the best results for each 
version of the dataset (full features and best features). We initially tested each model with a batch size of 128, 
which allowed us to get good timing for all models but with a poor performance, compared to a smaller batch 
size like 32 records that leads to an improved result but with a higher computation time. 

As shown in Figure 10 and Table 2, CNN models in both dataset versions “Full Features” and “Best 
Features” reaches respectively 99,430 and 99,935 as accuracy in just 1395s and 823s (15 and 10 epochs for 
each). Compared to our CNN, GRU is a little more accurate but it took more time to be trained (12412s and 
5818s in 40 and 20 epochs). The time of our CNN is almost twice compared to simple RNN for the “Full 
Features” version (6089s in 40 epochs), and almost equal to the “Best Features” version (5708s in 40 
epochs). On the other hand, LSTM models have their best accuracy in a timing between simple RNN and 
GRU (with 10627s and 8302s in 30 and 15 epochs). For the Loss as shown in Figure 11 and Table 2, CNN 
models are always the best reaching models in both dataset versions, “Full Features” and “Best Features”, 
with respectively 0,582% and 1,663% compared to other RNN models (between 1.7% and 2.9%). By 
examining these results, it is clear that the CNN model using “Full Features” is the best model according to 
its higher accuracy (0.9993), and the best time performance when considering the whole time for obtaining 
these results. In addition, when comparing the resulting metrics in all models with the two-dataset versions, 
we clearly see that the “Full Features” version gets best performance related to “Best Features” version. This 
means that deep learning algorithms can achieve better classification and learn additional features with 
additional traffic data as already explained in the previous section, but in computation time and energy 
consumption, it consumes more because more data to train are needed. 

In Figure 12, we compared the time that a prediction would take for all the considered models (how 
much time the IDS would take to identify an attack) which could be another important metric for the IDS 
implementation in a real IoT environment. We then concluded that CNN has not only the best accuracy and 
the lowest loss, but also the lowest time to predict an attack with just a fragment of milliseconds (with an 
average of 0.34ms). The other models make a detection of an attack in more than twice or triple that time of 
CNN (0.78ms for simple RNN, 1.03ms for GRU, and 1.08ms for LSTM). 
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The amount of data in the dataset is considerably unbalanced regarding the different types of 
attacks. For example, DoS (HTTP), DDoS (HTTP), key logging, and data exfiltration attacks in the training 
and test sets have very small amount of data as shown in Figure 7. Therefore, the model has a limited 
capability to learn accurately these types of attacks. The detection average of these attacks is one of the main 
factors restricting the overall detection accuracy. 
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Table 2. Training and testing results for the implemented models 
DL Convolutional Neural Simple RNN (Recurrent Long Short-Term Memory Gated Recurrent Unit (GRU) 
Algorithm Network (CNN) Neural Network) (LSTM) based RNN based RNN 
Dataset Full Best Full Best Full Best Full Best 
version Features Features Features Features Features Features Features Features 
Epochs 15 10 40 40 30 25 40 20 
Layers 4 4 4 4 4 4 4 4 
pera 0.99762 0.99086 —-0.98167 0.98038 0.98395 0.98285 0.98787 0.98301 
ee 0.01132 0.03294 ~—S-0,07101 0.07115 0.06155 0.06526 0.04508 0.06661 
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Accuracy 9.99935 0.99430 -—«0.99406 0.99427 0.99401 0.99747 0.99403 0.99430 
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Figure 12. Prediction timing comparison 


6. CONCLUSION 

The number of IoT objects dispatched around the world is growing progressively, which makes its 
security a big challenge. Our work here focused on intrusion detection systems for IoT using four variants of 
deep learning models, and compared them efficiently to detect various types of IoT network attacks, usually 
done by botnets. After several experiments, we obtained a reasonable detection rate on our all four models. 
By analyzing the obtained results, we concluded that CNN is the best one for intrusion detection systems, it 
was able to identify successfully different types of attacks and showed the higher accuracy (with 99.94%) in 
comparison with other DL algorithms, such as Simple RNN, LSTM, and GRU, and it allowed a detection 
with lower loss rates (0.58%) and a better performance in terms of prediction time. As future works, we will 
apply the CNN model on a real network traffic data, and reinforcing it using deep transfer learning by adding 
other characteristics from other datasets or from diverse firewalls, logs, and IDS servers. In addition, we will 
try to use self-supervised learning to generate a powerful and updated model, and then implement an 
autonomous intrusion detection system. 
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